PROJECT 4 - RED WINE DATASET by JUAN SILVA

Data load and initial inspection

Let’s just display summaries of the data set in various ways to get some sense of the data

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

Let’s first see the distribution of each one of the variables

It is clear some variables are very much normally distributed, some others not so much. More discussion about this in the analysis.

For now it is clear that Residual Sugar, Chlorides and total Sulfur Dioxide distributions have very long tails. Let’s transform the data to log10 to see how it looks.

Univariate Analysis

What is the structure of your dataset?

The data set is tidy. The initial column has unique IDs for each of the wines included. The rest of the variables are measurements of variables of the chemical composition of the wine. The last variable is the quality assessed for the wine.

Quality is an integer which can take values from 0 to 10. None of the wines in the data set has values lower than 3 or higher than 8. The quality seems normally distributed.

What is/are the main feature(s) of interest in your dataset?

The main feature would be the quality. The main interest is in knowing which other variables are directly correlated with wine quality. It also may be of interest to see what variables are correlated to each other, independetly of how they affect the quality.

In particular I would be interested in Residual Sugar, Alcohol and Citric Acid, since from my perspective these are among the more palpable features that could affect a wine taste and perceived quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

All other features could play a factor in the perceived quality of the wine.

Did you create any new variables from existing variables in the dataset?

Yes, I created a Variable named “Category” which will indicate the quality in three possible values High (Score 7 and 8), Medium (5, 6) , and Low (3,4). This to group the wines by quality level and see their statistics for the main features at that level of granularity.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Some of the variables have a clear normal distribution such as Density and pH as well as the acidity (volatile and fixed), although those latter two with a slightly long tail.

Chlorides and Residual sugar have very long tails. Which means that the majority of the wines have a low amount of salt and sugar respectively, and very few have high amounts. Similar cases happen with sulfur dioxide. In which case makes sense since you would want a good amount of it to prevent oxidation but not too much that will be perceived in the taste.

Interesting cases to me are the alcohol and the citric acid. In the case of alcohol we can see that its levels are more widespread. Although still it is clear that less wines have higher amounts of alcohol. The clear peak is between 9 and 10 percent.

In the case of Citric Acid we can almost observe 3 peaks. One close to 0 grams per liter, another one around 0.25 and one more at 0.5. I wonder if those fixed amounts are typical measurements for wines defines by some other criteria.

The quality seems to have the typical normal distribution with a mean of 5.6 and a median of 6. I found interesting that none of the wines were evaluated higher than 8 or less than 3.

Bivariate Plots Section

I wonder how the distribution of some of these variables change depending of the quality

There seems to be a few differences in the distributions depending on the quality for Alcohol and Citric Acid.

Let’s subset the wines in medim quality (5-6) and high quality (7-8) and see their stats summaries for each of these features. For this I will add a column with “Category” which will indicate the quality in three possible values High (Score 7 and 8), Medium (5, 6) , and Low (3,4).

##     alcohol     
##  Min.   : 9.20  
##  1st Qu.:10.80  
##  Median :11.60  
##  Mean   :11.52  
##  3rd Qu.:12.20  
##  Max.   :14.00
##     alcohol     
##  Min.   : 8.40  
##  1st Qu.: 9.50  
##  Median :10.00  
##  Mean   :10.25  
##  3rd Qu.:10.90  
##  Max.   :14.90
##     alcohol     
##  Min.   : 8.40  
##  1st Qu.: 9.60  
##  Median :10.00  
##  Mean   :10.22  
##  3rd Qu.:11.00  
##  Max.   :13.10

##   citric.acid    
##  Min.   :0.0000  
##  1st Qu.:0.3000  
##  Median :0.4000  
##  Mean   :0.3765  
##  3rd Qu.:0.4900  
##  Max.   :0.7600
##   citric.acid    
##  Min.   :0.0000  
##  1st Qu.:0.0900  
##  Median :0.2400  
##  Mean   :0.2583  
##  3rd Qu.:0.4000  
##  Max.   :0.7900
##   citric.acid    
##  Min.   :0.0000  
##  1st Qu.:0.0200  
##  Median :0.0800  
##  Mean   :0.1737  
##  3rd Qu.:0.2700  
##  Max.   :1.0000

Now that I have made groups I would like to see the distribution of the main variables that interest me, alcohol and citric acid.

It is hard to tell because there are just a few instances of wines with low (3,4) or high (7,8), but it seems that there is a difference in distribution of alcohol levels for wines graded higher than 5. It looks like the wines in the higher quality tend to have higher levels of alcohol. Wines with quality 5 and 6 have a mean of 10.25 grams, while wines with quality 7 and 8 have a mean of 11.52. This is also supported by the box plots where we can see the median values for each one of the three quality categories.

Sugar on the other hand seems to have the same distribution across wines of all qualities.Peaking between two and three grams per liter.

Citric acid distribution seems about the same for wines with quality 5 and 6. But for those in higher quality the initial peak close to 0 seems to be very reduced leaving the majority of wines with amounts between 0.25 and 0.50 grams per liter. Wines with 5-6 quality have a mean of 0.25, while 7-8 quality have a mean of 0.37. Again this is depicted by the box plot and the median values.

Now let’s inspect the correlation of thesse.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

At first glance there does not seem any super strong correlations betwen variables. Certainly not between quality and any other. But let’s look at scatter plots of the strongest relationships of quality against other variables

There are clear and obvious vertical lines for quality since it is a descrete variable.

Let’s try looking at the means for each quality score, and plot for the other strong correlation variables against quality.

Besides the correlation between Quality and those other variables, there are in fact stronger correlations between the others that were of less interest. Some positive some negative.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There aren’t any strong correlations (>= 0.7) between quality and and the other features. But Quality seems to be moderately correlated with some of them. In particular with Alcohol. I sort of expected that correlation to be there, same with citric acid. As the quality increases so do the levels of Alcohol and Citric acid. Although this does not mean that those substances increase the quality of the wine, I would expect some relationship since alcohol and the freshnes given by citric acid are essencial parts of the flavour of a wine.

Alcochol average levels start at 10 for lower quality wines, and stop at around 12 for the highest rated wines. I think the slight increase and then decrease observed for wines with a quality score of 4 accounts for the rather moderate correlation.

Citric acid goes up as does the quality from around 0.2 to 0.37. There are no downwwards trends but the slope is a bit less steep than for the alcohol from quality 5 to 8.

Sulfates have a low positive correlation while volatile acidity has a negative correlation with a clear decline of the sulfate level with wine quality between 3 and 7.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Maybe an interesting, albeit expected strong negative correlation is between pH and Fixed Acidity. As the pH increases and the wine becomes more alkaline the acidity reduces.

Both Free and Total sulfure dioxide are positively correlated which is also expected as the free form is a subset of the total sulfure dioxide.

Citric Acid has an interesting relationship with the acidity levels. It has a positive correlation with fixed acidity and a negative correlation with volatile acidity. This is explained by this article in wikipedia: https://en.wikipedia.org/wiki/Acids_in_wine#Citric_acid, where it is clarified that citric acid is in fact a fixed acid. I speculate that as wine makers add more fixed acids they reduce the volatile acids. This is somewhat supported by looking at the negative correlation between these types of acids.

What was the strongest relationship you found?

The strongest relationhip was bwteen pH and Fixed Acidity. But as explained before that is just because of the nature of those variables. The strongest correlation including Quality was against Alcohol levels with 0.47616632. I was suprised at the low correlation between Residual Sugar and Quality which only was 0.04207544.

Multivariate Plots Section

I am mostly interested in seeing the same correlated variables by wine quality category to see if there are any big differences.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

For most of the correlated variables we can observe that the same mean value trends are found across all wine quality categories.

Perhaps one exceptions are high and Low quality wines when comparing Total vs Free sulfur dioxide. In which case their means are somewhat far from the overall mean. I think just some wines in those categories have atipical amounts of fulfur dioxide. But looking at the scatter points behind we can see that it is mostly lone instances of wines that are far from the average mean and pull the mean for that category.

Were there any interesting or surprising interactions between features?

Not particularly, they seem to follow the expected trend that was seen when looking just a two variables.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

None at this time.


Final Plots and Summary

Plot One

Description One

Distribution of the variable most correlated to quality, Alcohol. We define wine quality categories of Low (Scores 3 and 4), Medium (Scores 5 and 6) and High (Scored 7 and 8).

Most of the wines rated fall in the Medium quality for this variables. Looking at the overall distribution low quality wines have levels on the lower side of the scale, wihile High quality ones have higher counts in the higher end of the scale.

Plot Two

Description Two

We focus on the correlation of Alcohol, the most correlated variable against wine quality scores. We define wine quality categories of Low (Scores 3 and 4), Medium (Scores 5 and 6) and High (Scored 7 and 8).

We can observe a tendency for higher quality wines to have higher levels of alcohol. Specially an observable difference is found between medium and high quality wines. The mean level of alcohol increases from just under 10% to slightly above 12% between score 5 and 8. While the median increases from 10% to 11.6% between Medium and High quality wines.

Plot Three

Description Three

We analyze the change in the average of Citric Acid levels as alcohol levels increase. Generally Medium quality wines follow the same trend as the overall average. High quality wines have considerable less citric acid when the alcohol increases over 12%. Low Quality wines seem to have pretty low levels of citric acid overall with a downward trend to 11% alcohol and and slight increase afterwards when alcohol increases.


Reflection

The objective was to find the variables that may have an inpact or would be tightly correlated to wine quality. I was expecting higher correlation to quality than the results. Looking at the correlation graph it was clear that there were no such high correlations.

Fortunately the Alcohol and Citric Acid while not having a high correlation score, do show a trend that suggests higher levels of those in higher quality wines.

Looking at correlation between other variables I was only able to understand how those substances or variables interact with each other and affect each other. But it was difficult to establish how any one of them or combination of them may have an impact on quality.

It would be interesting to play more with the data if it had more categorical variables to move around. Things like region where these wines are from, weather conditions on those regions, grape grow variables such as humidity and others could enrich the data an analysis to understand more what goes into crafting a good bottle of wine.